MA304 - Exploratory Data Analysis and Data Visualisation

Introduction

In this Project, I was provided with a comprehensive dataset about police work through Moodle. It contained a good amount of information such as the event’s location, its occurrence time, the involved officer, and the individual they interacted with. The primary objective of this project is to carry out an exploratory analysis to uncover any fascinating patterns, insights, or potential biases, such as those related to race or gender, as I was particularly interested in examining these aspects.

To accomplish this goal, I first preprocessed the dataset to ensure it was ready for analysis. Next, I undertook a two-stage analytical process. In the first stage, I closely examined each column in isolation to identify any discernible patterns, including potential racial or gender biases. Following that, the second stage entailed analyzing multiple columns concurrently to ascertain if there were any correlations or relationships between specific variables, again with a focus on detecting any interesting insights related to race or gender.

Pre-processing of Data

The dataset for this project was found to be semi-structured. When I explored the data, various issues became evident, such as the presence of multiple inputs within single cells, inconsistent date formatting, all data types represented as character forms, among other concerns. The following steps were taken to address these challenges and transform the dataset into a format suitable for further analysis.

Libraries Used

library(tidyr)
library(tidyverse)
library(lubridate)
library(ggplot2)
library(gridExtra)
library(dplyr)
library(devtools)
library(plotly)
library(leaflet)


data = read.csv('C:/Personal Data/Essex/Term 1/MA304 - Data Visualisation/37-00049_UOF-P_2016_prepped.csv')
mod_data = data

Splitting/Deleting the Columns

##In this code, I first cleaned up the dataset by removing the header row and rows with missing values, as well as unnecessary columns. Then, I reformatted date and time columns and split them into separate, more manageable columns. Finally, I separated columns with multiple values, like force effectiveness levels and injury types, into individual columns for easier analysis.


# Eliminate the first row (often the header) and rows containing missing values (NA)
mod_data = mod_data[-1,]
# Discard unwanted columns from the dataset
mod_data$UOF_NUMBER = NULL
mod_data$STREET_DIRECTION = NULL
mod_data$NUMBER_EC_CYCLES = NULL
# Transform INCIDENT_DATE column into a practical date format (month-day-year)
mod_data$INCIDENT_DATE = mdy(mod_data$INCIDENT_DATE)
# Split INCIDENT_DATE column into three new columns: Incident_Year, Incident_Month, and Incident_Day
mod_data = separate(mod_data,
                    col = INCIDENT_DATE,
                    into = c("Incident_Year","Incident_Month","Incident_Day"),
                    sep = "-"
                    )
# Divide INCIDENT_TIME column into two new columns: Incident_TIME and Incident_AM_PM
mod_data = separate(mod_data,
                    col = INCIDENT_TIME,
                    into = c("Incident_TIME","Incident_AM_PM"),
                    sep = " "
                    )
# Break Incident_TIME column into three new columns: Incident_Hour, Incident_Minute, and Incident_Second
mod_data = separate(mod_data,
                    col = Incident_TIME,
                    into = c("Incident_Hour","Incident_Minute","Incident_Second"),
                    sep = ":"
                    )
# Discard unneeded columns: Incident_Minute and Incident_Second
mod_data$Incident_Minute = NULL
mod_data$Incident_Second = NULL
# Change OFFICER_HIRE_DATE column into a practical date format (month-day-year)
mod_data$OFFICER_HIRE_DATE = mdy(mod_data$OFFICER_HIRE_DATE)
# Split OFFICER_HIRE_DATE column into three new columns: OFFICER_Hire_Year, OFFICER_Hire_Month, and OFFICER_Hire_Day
mod_data = separate(mod_data,
                    col = OFFICER_HIRE_DATE,
                    into = c("OFFICER_Hire_Year","OFFICER_Hire_Month","OFFICER_Hire_Day"),
                    sep = "-"
                    )
# Divide FORCE_EFFECTIVE column into several new columns, each for a distinct force effectiveness level
mod_data = separate(mod_data,
                    col = FORCE_EFFECTIVE,
                    into = c("FORCE_EFFECTIVE1","FORCE_EFFECTIVE2","FORCE_EFFECTIVE3",
                             "FORCE_EFFECTIVE4","FORCE_EFFECTIVE5","FORCE_EFFECTIVE6",
                             "FORCE_EFFECTIVE7","FORCE_EFFECTIVE8","FORCE_EFFECTIVE9",
                             "FORCE_EFFECTIVE10"
                             ),
                    sep = ","
                    )

# Divide SUBJECT_OFFENSE column into Several new columns, each for a distinct offense level
mod_data = separate(mod_data,
                    col = SUBJECT_OFFENSE,
                    into = c("SUBJECT_OFFENSE1","SUBJECT_OFFENSE2","SUBJECT_OFFENSE3",
                             "SUBJECT_OFFENSE4","SUBJECT_OFFENSE5","SUBJECT_OFFENSE6",
                             "SUBJECT_OFFENSE7"
                             ),
                    sep = ","
                    )
# Divide SUBJECT_INJURY_TYPE column in Several new columns, each for a distinct Injury level 
mod_data = separate(mod_data,
                    col = SUBJECT_INJURY_TYPE,
                    into = c("SUBJECT_INJURY_TYPE1","SUBJECT_INJURY_TYPE2","SUBJECT_INJURY_TYPE3",
                             "SUBJECT_INJURY_TYPE4","SUBJECT_INJURY_TYPE5","SUBJECT_INJURY_TYPE6",
                             "SUBJECT_INJURY_TYPE7","SUBJECT_INJURY_TYPE8","SUBJECT_INJURY_TYPE9",
                             "SUBJECT_INJURY_TYPE10","SUBJECT_INJURY_TYPE11","SUBJECT_INJURY_TYPE12"
                             ),
                    sep = ","
                    )
# Divide OFFICER_INJURY_TYPE column in Several new columns, each for a distinct officer injury level
mod_data = separate(mod_data,
                    col = OFFICER_INJURY_TYPE,
                    into = c("OFFICER_INJURY_TYPE1","OFFICER_INJURY_TYPE2","OFFICER_INJURY_TYPE3",
                             "OFFICER_INJURY_TYPE4","OFFICER_INJURY_TYPE5","OFFICER_INJURY_TYPE6"
                             ),
                    sep = ","
                    )

Changing NA Values

# This code replaces the phrase "No injuries noted or visible" with "No Injury" for the columns OFFICER_INJURY_TYPE1 and SUBJECT_INJURY_TYPE1.
mod_data$OFFICER_INJURY_TYPE1[mod_data$OFFICER_INJURY_TYPE1 == "No injuries noted or visible"] = "No Injury"
mod_data$SUBJECT_INJURY_TYPE1[mod_data$SUBJECT_INJURY_TYPE1 == "No injuries noted or visible"] = "No Injury"

# This code replaces any missing (NA) values with "No Injury" for columns OFFICER_INJURY_TYPE2 to OFFICER_INJURY_TYPE6.
for (j in 2:6) {
  col_name = paste("OFFICER_INJURY_TYPE", j, sep="")
  mod_data[[col_name]][is.na(mod_data[[col_name]])] = "No Injury"
}

# This code replaces any missing (NA) values with "No Injury" for columns SUBJECT_INJURY_TYPE2 to SUBJECT_INJURY_TYPE12.
for (j in 2:12) {
  col_name = paste("SUBJECT_INJURY_TYPE", j, sep="")
  mod_data[[col_name]][is.na(mod_data[[col_name]])] = "No Injury"
}

# This code replaces any missing (NA) values with "None" for columns SUBJECT_OFFENSE2 to SUBJECT_OFFENSE7.
for (j in 2:7) {
  col_name = paste("SUBJECT_OFFENSE", j, sep="")
  mod_data[[col_name]][is.na(mod_data[[col_name]])] = "None"
}

# This code replaces any missing (NA) values with "NA" for columns FORCE_EFFECTIVE2 to FORCE_EFFECTIVE10.
for (j in 2:10) {
  col_name = paste("FORCE_EFFECTIVE", j, sep="")
  mod_data[[col_name]][is.na(mod_data[[col_name]])] = "NA"
}

# This code replaces any empty strings with "NONE" for columns TYPE_OF_FORCE_USED2 to TYPE_OF_FORCE_USED10.
for (j in 2:10) {
  col_name = paste("TYPE_OF_FORCE_USED", j, sep="")
  mod_data[[col_name]][mod_data[[col_name]] == ""] = "NONE"
}

Changing Data Types

 # This code changes the data types of several columns in the mod_data dataframe.

# First, specify the columns to be converted to factor (categorical variables).
columns_to_factor = c("OFFICER_GENDER", "OFFICER_RACE", "OFFICER_INJURY", "OFFICER_INJURY_TYPE1",
                      "OFFICER_INJURY_TYPE2", "OFFICER_INJURY_TYPE3", "OFFICER_INJURY_TYPE4",
                      "OFFICER_INJURY_TYPE5", "OFFICER_INJURY_TYPE6", "OFFICER_HOSPITALIZATION", "SUBJECT_RACE",
                      "SUBJECT_GENDER", "SUBJECT_INJURY", "SUBJECT_INJURY_TYPE1", "SUBJECT_INJURY_TYPE2", "SUBJECT_INJURY_TYPE3",
                      "SUBJECT_INJURY_TYPE4", "SUBJECT_INJURY_TYPE5", "SUBJECT_INJURY_TYPE6", "SUBJECT_INJURY_TYPE7",
                      "SUBJECT_INJURY_TYPE8", "SUBJECT_INJURY_TYPE9", "SUBJECT_INJURY_TYPE10", "SUBJECT_INJURY_TYPE11",
                      "SUBJECT_INJURY_TYPE12", "SUBJECT_WAS_ARRESTED", "SUBJECT_DESCRIPTION", "DIVISION", "SUBJECT_OFFENSE1",
                      "SUBJECT_OFFENSE2", "SUBJECT_OFFENSE3", "SUBJECT_OFFENSE4", "SUBJECT_OFFENSE5", "SUBJECT_OFFENSE6",
                      "SUBJECT_OFFENSE7", "LOCATION_DISTRICT", "STREET_NAME", "STREET_TYPE", "LOCATION_CITY", "LOCATION_STATE",
                      "INCIDENT_REASON", "REASON_FOR_FORCE", "TYPE_OF_FORCE_USED1", "TYPE_OF_FORCE_USED2",
                      "TYPE_OF_FORCE_USED3", "TYPE_OF_FORCE_USED4", "TYPE_OF_FORCE_USED5",
                      "TYPE_OF_FORCE_USED6", "TYPE_OF_FORCE_USED7", "TYPE_OF_FORCE_USED8", "TYPE_OF_FORCE_USED9",
                      "TYPE_OF_FORCE_USED10", "Incident_AM_PM", "FORCE_EFFECTIVE1", "FORCE_EFFECTIVE2", "FORCE_EFFECTIVE3",
                      "FORCE_EFFECTIVE4", "FORCE_EFFECTIVE5", "FORCE_EFFECTIVE6", "FORCE_EFFECTIVE7", "FORCE_EFFECTIVE8",
                      "FORCE_EFFECTIVE9", "FORCE_EFFECTIVE10")

# Next, specify the columns to be converted to integers.
columns_to_integers = c("OFFICER_ID", "OFFICER_YEARS_ON_FORCE", "SUBJECT_ID",
                         "REPORTING_AREA", "BEAT", "SECTOR", "STREET_NUMBER",
                         "Incident_Year", "Incident_Month", "Incident_Day",
                         "Incident_Hour", "OFFICER_Hire_Year", "OFFICER_Hire_Month",
                         "OFFICER_Hire_Day")

# Lastly, specify the columns to be converted to numerics.
columns_to_numeric = c("latitude", "longitude")


# Change the column names for latitude and longitude to be lowercase.
colnames(mod_data)[colnames(mod_data) == "LOCATION_LATITUDE"] = "latitude"
colnames(mod_data)[colnames(mod_data) == "LOCATION_LONGITUDE"] = "longitude"

# Convert the specified columns to their respective data types.
mod_data[columns_to_numeric] = lapply(mod_data[columns_to_numeric], as.numeric)
mod_data[columns_to_factor] = lapply(mod_data[columns_to_factor], factor)
mod_data[columns_to_integers] = lapply(mod_data[columns_to_integers], as.integer)

# Convert month numbers to abbreviated month names
mod_data$Incident_Month = month.abb[mod_data$Incident_Month]

#The code above describes the following steps:

#1: Renaming the columns that contain latitude and longitude to lowercase.
#2: Converting the columns in the mod_data dataframe to their respective data types as specified - numeric, factor or integer.
#3: Converting the month numbers in the Incident_Month column to their abbreviated month names.

TIME ANALYSIS

At what specific times and during which months do incidents occur? Is the frequency of incidents higher in the morning or in the evening? *Does the frequency of incidents change during holidays?

##INCIDENTS BY YEAR *The first aspect that caught my attention was how the incidents are distributed throughout the year. Some months have no reported crimes, while other months have a higher-than-average rate of incidents.

INCIDENTS BY MONTH

  • There are certain days of each month that appear to have a higher incidence of incidents than others, such as major holidays.

  • Analysis of the data indicates that the bulk of crimes occur in the middle of the month, with the second and third weeks seeing the highest number of incidents on average. For example, the average number of incidents in February is closer to Valentine’s Day, while in March, it is a few days before St. Patrick’s Day, and in April, it is closer to Easter.

INCIDENTS BY DAY

  • Incidents take place either in the morning or the evening every day. I believe that the majority of incidents occur in the evening since visibility is reduced during that time.

  • The Jitter plot displaying incidents by AM or PM clearly illustrates that most incidents happen in the PM. This corroborates my hypothesis. Another noteworthy finding is that the lowest incidence rates are observed between 4 and 10 in the morning, perhaps because most individuals are asleep during that period.

OFFICER and SUBJECT ANALYSIS

There are several potential queries concerning the officers and incidents.

What is the total number of male and female officers?

Which racial groups constitute the majority of officers and subjects involved in incidents?

What is the tenure of the officers in uniform?

What types of injuries do subjects and officers sustain in incidents?

Are these injuries requiring hospitalization?

What percentage of the subjects involved in incidents are ultimately detained?

What are the top ten reasons for arrests?

Years of Service as an Officer

In a profession like law enforcement, experience is crucial. The duration of an officer’s service in the police force is significant information for further analysis. An officer with more experience is likely to be better equipped to handle various situations.

In my findings, I compared the years of service based on race and gender.

YEARS IN SERVICE (GENDER)

The histograms below reveal that the majority of male and female police officers have less than ten years of experience, with the distribution skewed to the left. While there are a few individuals with over 30 years of experience, there are fewer officers with more than ten years of service. Since the majority of officers have less than a decade of experience, it can be beneficial to have officers with more experience since they can learn from those with more years of service.

YEARS IN FORCE (RACE)

*The histograms show a leftward skew in the racial distribution of officers, indicating that most officers of all races have less than a decade of experience. However, an interesting discovery is that there were few recorded incidents for American Indian officers who had been in service for 10 or more years and Asian officers who had been serving for 20 or more years. This observation is thought-provoking and suggests that officers from certain racial groups may exhibit more self-restraint or experience fewer incidents over time.

OFFICER AND SUBJECT GENDER

  • It is evident that there is a significant gender disparity between police officers and the individuals they interact with. The majority of officers are male, while female officers are a minority, with less than 500 recorded. Interestingly, female subjects in police encounters outnumber the number of female officers by a significant margin. This discrepancy raises concerns about the representation of women in law enforcement and highlights the need for more efforts to increase gender diversity in the field. Additionally, it suggests that there may be gender-related biases or issues in policing practices that need to be addressed to ensure equal treatment for all individuals involved in police encounters.

OFFICER AND SUBJECT RACE

The data indicates a clear and significant discrepancy between the racial makeup of police officers and the individuals they encounter. Black individuals are the majority in police encounters, but White officers constitute the largest racial group among law enforcement officers. Hispanic and Black officers follow White officers in number, whereas officers from other racial groups are notably underrepresented. In terms of the racial makeup of individuals encountered by police officers, Black individuals are the most frequently encountered group, followed by Hispanics and then Whites, which contrasts with the racial distribution of police officers.

These findings suggest that there is an ongoing disconnect between the racial diversity of law enforcement officers and the individuals they interact with. The dominance of White officers in policing raises questions about the potential impact of their racial identity on their interactions with individuals from diverse racial backgrounds. The underrepresentation of minority groups among law enforcement officers, particularly Hispanics, further highlights the need for efforts to increase diversity in policing.

Injuries to officers and subjects:

##Injury Statistics by Gender by Officer and Subject

INJURIES TO OFFICERS AND OBJECTS (RACE)

When we look at the data on injuries based on race, it’s clear that black people are injured more often. White police officers get hurt the most frequently. But it’s important to remember that most people, whether they are police or not, don’t get injured at all, as shown in the data.

The Top 5 Officer and Subject Injuries

The Leading 5 Due to the similarities in the injuries sustained by subjects and cops, it is assumed that fights injure both parties. One distinction is that Taser Burm Marks—an damage brought on by one of the non-lethal devices an officer can be carrying—are present on individuals. Even when there are injuries, most people are not hurt in an event.

HOSPITALISATION OF OFFICERS

  • Based on the bar graph below, it is clear that the majority of police officers who were injured did not sustain significant injuries, as only a small percentage of officers required hospitalization. This finding suggests that while injuries do occur among police officers, they are generally not severe enough to require hospitalization.

SUBJECT ARRESTED IN?

*Most of the Subjects were taken into custody.

DESCRIPTION OF THE SUBJECT

  • The majority of the time, subjects were described as having mental instability, followed by alcohol and other drugs.

TOP 10 SUBJECT OFFENSES

  • The majority of persons received offences for using a weapon to assault a police officer.

TOP 5 INCIDENT REASON AND TOP 5 REASONS FOR FORCE

The Top 5 Force Use Reasons and Incident Reasons are pretty similar and go hand in hand.The justifications for using force are quite accurate and do not use disproportionate amounts of force in light of the incident.

THE TOP 5 FORCES USED EFFECTIVELY

  • When the top five forces were analysed, verbal command, which was ranked first, proved to be the least effective, while the other four were highly successful. Given that verbal commands are the most common and least effective, it’s possible that they are standard operating practise.

# ASSESSMENT OF LOCATION

  • Location analysis is crucial for determining which sections of the state require increased surveillance and which don’t experience many events.

  • Which sorts of streets saw the most accidents?

  • Which streets are the scene of the most accidents?

  • A map showing the top five incidents.

STREETS WITH THE HIGHEST INCIDENT RATE

  • Based on the study of the streets, the best course of action would be to place more cops in those streets to ensure the safety of the locals.

MAP OF THE TOP 5 CRIMES IN DALLAS TEXAS

  • It is evident from the map below that the occurrences are dispersed throughout the state of Dallas and not just in one location. One finding is that there are more incidences concentrated in Downtows Dallas than in other places.

  • Black = APOWW (Assaulting Police officer with weapon)

  • Red = Public Intoxiation

  • Blue = Assault/FV

  • Orange = Warrant/Hold

  • Pink = Evading Arrest

CONCLUSION

*After analyzing the data, there are several interesting insights that can be derived. One such finding is that there is a significant disparity between the gender and racial makeup of law enforcement officers and the individuals they interact with. This raises concerns about the potential impact of identity-based biases on the interactions between police and individuals from diverse backgrounds.

Another intriguing observation is that despite the common perception that December would have the highest crime rate due to the holiday season, December had the lowest incidence of crimes. On the other hand, March had the highest incidence of crimes, with a gradual decrease until September, when crime rates start to increase again. The reasons for these trends require further investigation.

Another notable finding is that American Indian officers with 10 or more years of experience and Asian officers with 20 or more years of experience had very few incidents recorded. This observation is intriguing and suggests that officers from certain racial groups may exhibit more restraint or have fewer incidents over time.

Regarding injuries, it was found that black individuals suffered the majority of injuries when injuries were broken down by race, while White officers were injured the most frequently. Despite this, the majority of both officers and individuals avoided injury in most events.

The most common reason for police encounters was mental instability, followed by alcohol and other drugs, which were also the top reasons for using force. However, the top five forces used effectively did not include verbal commands, which were the most common but least effective force used, suggesting a potential need for changes in police training and standard operating procedures.

Finally, the map of the top five crimes in Dallas, Texas, revealed that incidences were dispersed throughout the state, with more concentrated in Downtown Dallas. This suggests a potential need for targeted policing strategies in high-crime areas.

Overall, this investigation represents a significant step forward in understanding the complex dynamics of crime and law enforcement in Dallas. By shedding light on these issues, we can work towards a more just and equitable society for all.